Information theoretic approach to interactive learning 
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Abstract. - The principles of statistical mechanics and information theory play an important 
role in learning and have inspired both theory and the design of numerous machine learning 
algorithms. The new aspect in this paper is a focus on integrating feedback from the learner. 
A quantitative approach to interactive learning and adaptive behavior is proposed, integrating 
model- and decision-making into one theoretical framework. This paper follows simple principles 
by requiring that the observer's world model and action policy should result in maximal predictive 
power at minimal complexity. Classes of optimal action policies and of optimal models are derived 
from an objective function that reflects this trade-off between prediction and complexity. The 
resulting optimal models then summarize, at different levels of abstraction, the process's causal 
organization in the presence of the learner's actions. A fundamental consequence of the proposed 
principle is that the learner's optimal action policies balance exploration and control as an emerging 
property. Interestingly, the explorative component is present in the absence of policy randomness, 
i.e. in the optimal deterministic behavior. This is a direct result of requiring maximal predictive 
power in the presence of feedback. 
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Introduction. — The problem of learning a model, 
or model parameters, from observations obtained in ex- 
periments, appears throughout physics and the natural 
sciences as a whole. The statistical mechanics of learning 
have been discussed in many contexts [1,2], such as neu- 
ral networks, support vector machines [3,4], and unsuper- 
vised learning via compression [5]. The latter, information 
theoretic approach essentially views learning as lossy com- 
pression - data are summarized with respect to some rel- 
evant quantity [6] . This can be an average variance [5] , or 
■ any other measure of either distortion [7] or relevance [6] . 
Applied to time series data, one can show [8] that if pre- 
diction is relevant, then representations are found by this 
approach that constitute unique sufficient statistics [9] and 
which can be interpreted as underlying causal states [10] 
of the observed system. 

However, the role of the observer is not always a passive 
one, as is assumed in the large majority of work on learn- 
ing theory (see e.g. [11,12]). In many problems ranging 
from quantum mechanics, to neuroscience, to animal be- 
havior, the interactive coupling between the observer and 
the system that is being observed is crucial and has to be 
taken into account. 

In this paper, an information-theoretic approach to in- 



tegrated model and decision making is proposed. As a first 
step towards a general theory of adaptive behavior, let us 
ask a simple question: If the goal of a learner is to have 
as much predictive power as possible, then what is the 
least complex action policy, and what is the least complex 
world model that achieve this goal? 

The ability to predict improves the performance of a 
learner across a large variety of specific behaviors, and is 
hence quite fundamental, increasing the survival chance of 
an autonomous agent, or an animal, and the success rate 
on tasks, independent of the specific nature of the task. 
Furthermore, a good model of the world must general- 
ize well (see, e.g., [12]) — in other words, the quality of the 
learner's world model can be judged by how well it predicts 
as-yet unseen data. For those reasons prediction is in gen- 
eral crucial for any adaptively behaving entity. Therefore, 
as a first step, we focus on prediction. To model animal 
behavior, other constraints, such as energy consumption, 
are clearly also relevant. 

The approach taken here is related to, but different from 
active learning (e.g., [13-16]) and optimal experiment de- 
sign, which has found countless applications in physics, 
chemistry, biology and medicine ( [17]; for more recent 
reviews see, e.g., [18,19]). These approaches do not usu- 
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ally take feedback from the learner into account. Feed- 
back is modeled more explicitly in reinforcement learning 
(RL) [20], but this approach is limited to specific inputs, 
assuming that the learner receives a reward signal. In con- 
trast to RL, we step back and ask about behavior that is 
optimal with respect to learning about the environment 
rather than with respect to fulfilling a specific task. Our 
approach does not require rewards. 

Much of the RL literature assumes that the learner's ex- 
plorative behavior is achieved by some level of randomness 
of the behavioral policy [21]. Here we show, in contrast, 
that if learning and optimal model-making are the goal, 
then explorative behavior emerges as one component of 
the optimal policy - even in the absence of stochasticity: 
any policy which is optimal with respect to learning max- 
imally predictive models must balance exploration with 
control, including the optimal deterministic policy (see Eq. 
(21)). ' 

Conceptually, our approach could perhaps be thought 
of as "rewarding" information gain and, hence, curiosity. 
In that sense, it is related to curiosity driven RL [22], 
where internal rewards are given that correlate with some 
measure of prediction error. 1 However, an important dif- 
ference of the approach discussed here is that the learner's 
goal is not to predict future rewards, but rather to behave 
such that the time series it observes as a consequence of 
its own actions is rich in causal structure. This, in turn, 
then allows the learner to construct a maximally predic- 
tive model of its environment. 

Optimally predictive model and decision mak- 
ing. — Let there be a physical system to be learned, 
and call it the learner's "world" . A learner in parallel (i) 
builds a model of the world and (ii) engages in an interac- 
tion with the world. The learner's inputs arc observations, 
x(t), of (some aspects of) the world. Observations result 
in actions, a(t), through a decision process. Actions affect 
the world and so change future observations. 

Let us assume that the learner interacts with the envi- 
ronment between consecutive observations. 2 Let one de- 
cision epoch consist in mapping the current "history" , h 
(specified below), available to the learner at time t, onto 
an action (sequence) a that starts at time t and takes the 
time A to be executed. The next datum is sensed at time 
t + A. (We assume for simplicity that the times it takes 
to react and to sense are both negligible.) 

The decision function, or action policy [20], is given 
by the conditional probability distribution P(a\h). 3 Let 

x Our approach is fairly general, and to compare one has to adopt 
the specific RL setting, which we explore in [23] . 

2 This sequential setup is useful for the sake of simplicity. How- 
ever, a real agent continuously acts and senses, and an extension to 
this more involved case would be interesting. 

3 Short hand notation: the argument t is dropped. Actions a, in- 
ternal states s, futures z, and histories h are (possibly multi- valued) 
random variables with values A £ A, S £ <S, Z £ Z, and H £ H, 



the model summarize historical information, using inter- 
nal states s, via the probabilistic map P(s\h). The model 
and the policy depend upon each other, but histories are 
mapped independently onto (i) internal states (using the 
model P(s\h)), and (ii) action sequences (using the policy 
P(a\h)). Hence, actions and internal states are condition- 
ally independent, if the history h is given: 

P(s,a\h) = P(s\h)P(a\h). (1) 

The "internal state" does not change the statistics of the 
environment, but rather serves as an internal observer. 
The feedback due to the actions, however, changes the 
statistics of the environment. The action policy contains 
a model in the sense that if a large group of histories share 
the same optimal action, then the action can be viewed as 
a compressed representation of this "history-cluster" . 

The learner uses the current state, s(t), together with 
knowledge of the action, a(t), to make probabilistic pre- 
dictions of future observations, z(t), of length Tf. 4 

F(Z,S ' a) = Wa) (P(Z|/1 ' a ) p ( a \ h ) p ( s \ h ))p( h ) ■ ( 2 ) 

P(z\h, a) and P(h) are (for the moment) assumed to be 
known. A history always includes the current observa- 
tion, x(t). Beyond this, it may include a record of prior 
observations reaching some length r p into the past, and 
also previous internal state and action(s). Lengths of the 
internal records of past observations and past actions are 
assumed given by the learner's storage capacity. 

The problem of interactive learning then is to choose 
a model and an action policy, which are optimal in that 
they maximize the learner's ability to predict the world, 
while being minimally complex. 

We measure the learner's predictive ability by the mu- 
tual information [7] that the internal state, in the presence 
of the action, contains about the future: 

I[{ s ,a};z} = (log\^f\) . (3) 

The quantity 7[{s,a};z] = H[z] — i?[z|s,a] measures the 
reduction in the uncertainty about the future (entropy H ) , 
when state and action are known. It is zero if the future 
is independent of s and a. It is maximal if the knowl- 
edge of s and a eliminates all uncertainty about the future 
(H{z\s,a] = 0). 

respectively. 

4 Future observations, z(t), are given by the signal x(t') on the 
interval t' £ [t + A, t + A + Tr ], where A is the duration of the in- 
tervention given by the action, or the sequence of actions, initiated 
at time t, a(t). The learner is interested in understanding how one 
intervention changes the future. The action choice does depend on 
past actions, if they are included in the learner's history, h(t). How- 
ever, planning of consecutive future actions is not discussed here, 
but an extension would be desirable. The notation (-)p denotes the 
average taken over P. 
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Simple models and simple action policies come at a 
lower coding cost, quantified by the coding rates I[s; h] 
and I[a;h], respectively. The notion that the simplest 
possible model is preferable is deeply rooted in our cul- 
ture. William of Ockham is frequently cited on this mat- 
ter, which is known as "Ockham's razor" . In the same 
vein, out of two action policies which yield the same value 
of the objective, Eq. (3), one would choose the simpler 
policy, as there is no reason to implement a more complex 
policy which takes more memory. 

The interactive learning problem is solved by maximiz- 
ing /[{s, a}; 2] over P(s\h) and P(a\h), under constraints 
that select for the simplest possible model and the most 
efficient policy, respectively, in terms of smallest complex- 
ity measured by the coding rate. Less complex models 
and policies result in less predictive power. This trade-off 
can be implemented using Lagrange multipliers, A and [i. 
Following the spirit of rate distortion theory [7] , and, more 
closely related, the information bottleneck method (IB) [6], 
one can then calculate the best possible solution at each 
value of the Lagrange multipliers. The optimization prob- 
lem for interactive learning is given by: 



max (l[{s, a}; z] — A/[s; h] — /i/[a; h]) 
P(s\h) 
P(a\h) 



(4) 



The two constraints are taken into account individually, 
rather than as a sum, 5 so that their relative importance 
can be adjusted. Think, for example, about a robotic 
multi-agent system in which robots communicate their in- 
ternal states to each other. Limited communication chan- 
nel capacity may force them to produce compact internal 
representations, but the complexity of the action policy 
that each individual can implement does not have to be 
equally constrained. 

The trade-off parameters A and /1 parameterize families 
of optimal models and policies, respectively, constituting 
those models and policies that have maximal predictive 
power at fixed complexity. An analogy to statistical me- 
chanics is useful to guide intuition [5] , and relates A and (i 
to temperature - they control the "fuzziness" of the maps 
that assign histories to states and actions, respectively. 
This approach also relates the distortion function to the 
energy function of a corresponding physical system and 
the normalization constant to the partition function. 

Optimal action policies. The action policies that solve 
optimization problem, Eq. (4), are given by 



P op t(a\h) 



P ( a ) c -±E A (a,h) 



(5) 



5 I[{s, a}; h] + I[s; a] = I[a; h] + I[s; h], because of Eq. 1. 
I[{s, a}; h] is the coding rate of the learner's full behavior - consist- 
ing of both the internal state, s, and the action sequence, a. I[s; a] 
measures the redundancy, which should be minimized together with 
the coding rate. 



with the energy function 

E A (a,h) = (V[P(z\h,a)\\P(z\s,a)]) P{slh) 

-V[P(z\h,a)\\P(z)}, (6) 



and the partition function 



Z A (M) = <e-***<-% a 



(«)■ 



(7) 



= (\og[p/q]} p denotes the relative entropy, or 
Kullback-Leibler divergence between distributions p and 
q. Equations (5)-(7) must be solved self-consistently, to- 
gether with Eq. (2) and 



P(a) = (P(a\h)) PW , 

P{Z) = (<P(*|M)>P (o |fc) 



p(h) 



(8) 
(9) 



To derive this result (Eqs. (5)- (7)), one calculates 
I[{s, a}; z], using Eq. (2), and the functional derivative 
of Eq. (4) w.r.t. P(a\h). Individual nonzero contribu- 
tions are given by: 6 



SI[{s,a};z] 
5P(a\h) 



SI [a; h] 
SP(a\h) 



= P(h) (log 



P(z\s, a) 



P ( Z ) J / P{z\h,a) I p( s \h) 

P(h)V[P(z\h,a)\\P(z)] (10) 
-P(h) (V[P(z\h,a)\\P(z\s,a)]) P(slh) 
~P(a\h)~ 



= P(h)log 



Pa) 



(11) 



Observe that the most likely action is that of minimum 
energy (see Eq. (5)). The first term in the energy function, 
Eq. (6), 

(V[P(z\h,a)\\P(z\s,a)]} P(slh) (12) 

is smaller for actions that will, on average, make the con- 
ditional future distribution P(z\h, a) as similar as possible 
to the distribution that is predicted by the learner's inter- 
nal state, P(z\s,a). The average is taken over the model 
P(s\h). This term selects for actions that bias the future 
towards what the learner predicts - it is therefore related 
to the control that the learner can exert on the world. 
The second (negative) term in Eq. (6) 



-V[P(z\h,a)\\P(z)] 



(13) 



selects for actions that will make the conditional future 
distribution P(z\h, a) as different as possible from the av- 
erage P(z). The term embodies a preference for actions 
that bias towards an uncommon future distribution - it 
is related to exploration and causes the learner to perturb 
the world away from the average. 



6 Terms constant in a are omitted, because in the solution they 
are absorbed into Z\. 
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This shows that at the root of interactive learning there 
is a competition between exploration and control, which 
arises as a fundamental consequence of the proposed op- 
timization principle: Exploration and control have to be 
balanced in the optimal action policy to result in maximal 
predictive power. 

Optimally predictive models. The family of models 
that solve optimization problem Eq. (4), is given by 7 



P op t{s\h) 



Z s (h,X)' 



: E s (s,h) 



(14) 



with 



and 



E s (s, h) = (V[P(z\h, a)\\P(z\s, a)}) p(alh) (15) 



z s (M) = < e ~* Es(Vl) ) 



P(s) ■ 



(16) 



These equations must be solved self-consistently, together 
with Eq. (2) and 



P 



(s) = (P(s\h)) p(h) 



(17) 



The most likely state minimizes the relative entropy be- 
tween the actual, P(z\h,a), and the predicted, P(z\s, a), 
conditional future distribution (see Eqs. (14) and (15)), 
averaged over the action policy P(a\h). The internal states 
thus capture the effect that the history has on the proba- 
bility distribution over futures, under a given action pol- 
icy. In that sense, the optimal model reflects the causal 
structure of the underlying process. 

Altogether, Eqs. (5) and (14), must be solved self con- 
sistently (together with Eqs. (2), (6)-(9), and (15)-(17)) 
to yield the model that is optimally predictive under the 
optimal policy (and vice versa). This can be done itera- 
tively, resulting in an algorithm that is similar to the IB 
algorithm [6]. This new algorithm, however, includes a 
feedback loop, due to actions. 8 

With increasing A, the level of abstraction of the model 
increases, as less detail is kept. In the high temperature 
limit, A — > oo, all possible histories are effectively repre- 
sented by the same internal state. 9 



7 The derivation is similar to that for Eq. (5) and follows [6]. 
Individual contributions to the functional derivative w.r.t. P(s\h) 
are (ignoring constant terms): 

SI 6P%w' ] = ~ P W (V{P(z\h,a)\\P(z\s,a)]) 



P(a\h) 



and 



6P(s\h) ~ J V> iu & L P(s) . 

Details about the algorithm are given in [24], where examples 
are also discussed. An extension will be published elsewhere. 

9 As A — > oo, P(z\s,a) is the same for all states s: As 



A 



^opt 



(S\h) 



P(s), see Eq. (14), and with that 

P(s, a) = (P(s, a\h)) P(h) = {P(s\h)P(a\h)) P(h) - P(s)P(a), and 

P ( z \ s , a ) -> P( s )p(a) ( P ( z \h,a)P(a\h)P{s)) p{h) = P(z\a), Vs; see 
Eqs. (1) and (2). 



Deterministic models and decisions. In the low 
temperature limit (T — > 0; T G {A,^t}), the distributions 
in Eqs. (5) and (14) become deterministic mappings. 
To see this, let us use the discrete random variable 
y G {a, s}, and let E(y, h) denote the value of the energy 
function E\, if y = a, and Eg, if j/ = s. Furthermore, 
define the functions y*{h) := argminj, E{y, h) and 
£(y,h) := E(y,h) - E(y*(h),h) > 0. Now, we can write 
the conditional distribution for the most likely value 
y*(h) as 



P(y = y*(h)\h) = 



P(y = y*(h))_ e _± E 



(y*(h),h) 



Z(h,T) 



= 1 



V 



P(y) 



T<£(v,h) 



(18) 



Since £(y,h) is positive, the sum goes to zero as 
T — > (assuming that P(y*(h)) > 0). As a conse- 
quence, we have P{y = y*{h)\h) = 1 and, due to nor- 
malization, the optimal mapping becomes determinis- 
tic: PT~>o{y\h) = °~yy*{h)i where 8 denotes the Kronecker- 
Delta. 

For a deterministic model, specified by 
P\^o( s IM — ^s«*(/i)j this means that a history h is 
assigned with probability one to the state s = s*(h) which 
minimizes the energy function Es{s, h), Eq. (15): 



s*(h) = argmin (V[P(z\h,a)\\P(z\s,a)}} 



P(a\h) 



(19) 



Note that without constraints on the cardinality of the 
state space, one can always ensure that this minimum is 
zero: Es(s*(h),h) = 0. This fact then implies that the 
predicted information, J[{s,a};z], reaches its maximum 
at the optimal deterministic model, I[{s* 7 a}; z\. 

The maximum is given by the predictive information 
of the time series, in the presence of the learner's actions: 
/[{s*, a}; z] = I[{h, a}; z}. 10 The optimal policy now max- 
imizes this quantity, at fixed I[a,h\. This illustrates that 
the optimal policy makes as much information as possible 
available to be summarized by the model, at fixed policy 
complexity. 

Action policies become increasingly random with in- 
creasing fj, - the learner's reactions become less specific 
responses to the history. In the other limit, as the com- 
plexity constraint is relaxed by letting the parameter \x 
approach zero, one finds the optimal deterministic policy 
a*(/i) n which maximizes the predictive information of the 
time series, in the presence of the actions. 

The special case is of particular interest in which the 
learner produces deterministic maps s*(h) and a*(h), 



10 I[{s,a};z] = I[{h,a};z] - (E s (s, h)) p ^ ih y The second 
term vanishes for the optimal deterministic model. It becomes 
(E s (s*(h),h)) p(h) =0. 

a*(h) is given by Eq. 21. 
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which maximize the predictive power, Eq 3. The opti- 
mal deterministic model maps a history h to the internal 
state 

s*(h) := wgmmV[P(z\h,a*(h))\\P(z\s,a*(h))] . (20) 

s 

Assuming that there are no constraints on the cardinality 
of the state space, this map partitions the space of histories 
in a way that is similar to the causal state partition of 
[10]. One can show [8] that if actions are not considered 
(passive time series modeling) , then the passive equivalent 
of Eq. (20) exactly recovers the causal state partition 
of [10]. Causal states are unique and minimal sufficient 
statistics - constituting a meaningful representation of the 
underlying process [9]. 

The partition specified by Eq. (20) allows for an exten- 
sion of the causal state concept to interactive time series 
modeling: here the space of histories is partitioned such 
that all histories, h S 7i s C H, that are mapped to the 
same causal state, s, are causally equivalent under the op- 
timal action policy, a*(h); meaning that their conditional 
future distributions P(z\h, a*(h)) are the same. 

This grouping of histories results in an equivalence class 
that is controlled by the action policy: under any action 
policy, A(h) (where the map A : h \— > a is a determin- 
istic policy), two histories h and h! are equivalent with 
respect to their effect on the future, z, if P(z\h, A(h)) = 
P{z\h! , A(h')). The resulting partition, Sa, of the his- 
tory space into causal states depends on the action policy, 
A. The choice of the policy determines the nature of the 
time series which is produced by the system coupled to the 
observer through the actions. Note that there could be 
different action policies A' ^ A, which result in coupled 
systems with the same underlying causal state partition 
Sa = Sa 1 - The policy A = a* is the deterministic policy 
that creates the coupled world-observer system that can 
be predicted most effectively by a causal model. 

Optimal deterministic decisions for actions are made ac- 
cording to the rule 

a*(h) :=argmin[ (T>[P(z\h,a)\\P(z\s,a)}} p ^ h) 

-V[P(z\h,a)\\P(z)}]. (21) 

It is important to note that the term related to exploration 
(second term) persists in the optimal deterministic action 
policy, Eq. (21). This is in direct contrast to "Boltzmann 
exploration", commonly used in RL [20]. There, explo- 
ration is implemented as policy randomization by soften- 
ing of the optimal, deterministic policy (optimal in an RL 
sense by maximizing expected future reward). We have 
shown here, however, that to create data which allows for 
optimally predictive modeling, an exploratory component 
must be present even in the optimal deterministic policy. 
In our framework, exploration is hence an emerging be- 
havior, and it is not the same as policy randomization. 



Probability estimates and finite sampling errors. So 
far, we have assumed P(z\h,a) and P{h) to be known. 
However, in practice, they may have to be estimated from 
the observed time series. Hence there could be a bias 
towards overestimating 7[{s,a};z] due to finite sampling 
errors in the probability estimates. This may result in 
over-fitting. The accuracy of the estimates depends on the 
data set size, N . One can counteract finite sampling er- 
rors, using an approximate error correction method, such 
as discussed in [25]. This method has already been ap- 
plied successfully to predictive inference in the absence of 
actions [8] and it can also be applied in the presence of 
actions. 

Time dependent on-line learning procedure. In [25], 
we calculated bounds on the smallest temperature, T*(N), 
allowable before over-fitting occurs. This value depends on 
the data set size N. In the interactive learning setup, the 
data set size grows linearly with time. One can imple- 
ment an algorithmic annealing procedure, similar to the 
one in [5], but different in that the temperature is kept 
fixed at each time step and then changes over time with 
growing data set size. This captures the intuition that a 
learner may allow itself to model an increasing amount of 
detail the longer it has observed the world. The tempera- 
tures in each time step are set to (upper bounds on) the 
values A* (t) and \i* (t) , below which over-fitting would oc- 
cur. Since these can be calculated, an annealing rate, as 
used in deterministic annealing [5], is not necessary. The 
work in [25] directly provides a bound on X*(t) and could 
be extended to calculate a bound on fi* (t) . Tighter bounds 
or an exact calculation of T* would also be desirable. 

Possible extension to multi-agent systems. When mul- 
tiple agents observe and interact with an environment, 
they often exhibit emerging co-operative behavior. Under- 
standing the emergence of such co-operative strategies is 
an active field of research. In order to utilize our approach 
for the study of this phenomenon, we have to distinguish 
(i) the agents' available sensory input and (ii) whether 
there is communication between agents. In the simplest 
case, each of the agents has access only to data from the 
environment. Then each agent can be modeled exactly as 
we have outlined here, and all coupling happens implic- 
itly, through the environment. Communication of inter- 
nal states and/or the observation (or communication) of 
each others actions, however, means that the other agents' 
internal states and/or actions, respectively, must be in- 
cluded in each agent's input (history h). Furthermore, if 
agents try to learn about each others behavior, then we 
need to include the other agents' future actions into the 
data which ought to be predicted (future z). A detailed 
exploration of multi-agent learning has to be left for future 
research. 

Summary. — This paper has proposed an 
information-theoretic approach to a quantitative un- 
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derstanding of interactive learning and adaptive behavior 
by means of optimal predictive modeling and decision 
making. A simple optimization principle was stated: use 
the least complex model and action policy which together 
provide the learner with the largest predictive ability. 
A fundamental consequence of this principle is that the 
optimal action policy finds a balance between exploration 
and control. This is a direct consequence of optimal 
prediction in the presence of feedback due to the learner's 
actions. The theory developed here is general in that it 
makes no assumptions about the detailed structure of the 
underlying process that generates the data, and thus is 
not restricted to specific model classes. 
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